The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.linear_model import LinearRegression
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
from scipy.spatial.distance import pdist
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
# Function to create barplots that indicates percentage for each category
def bar_perc(data, column):
total = len(data[column]) # Length of the column
plt.figure(figsize = (15, 5))
# plt.xticks(rotation = 45)
ax = sns.countplot(data[column], palette = 'Paired')
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height() / total) # Percentage of each class
x = p.get_x() + p.get_width() / 2 - 0.05 # Width of the plot
y = p.get_y() + p.get_height() # Height of the plot
ax.annotate(percentage, (x, y), size = 12) # Annotate the percentage
plt.show()
# This function takes the numerical variable as the input and returns the boxplots and histograms for that variable
# This would help us write faster and cleaner code
def histogram_boxplot(feature, figsize = (15,10), bins = None):
""" Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(nrows = 2, # Number of rows of the subplot grid
sharex = True, # The X-axis will be shared among all the subplots
gridspec_kw = {"height_ratios": (.25, .75)},
figsize = figsize
)
# Creating the subplots
# Boxplot will be created and the mean value of the column will be indicated using some symbol
sns.boxplot(feature, ax = ax_box2, showmeans = True, color ='red')
# For histogram
sns.histplot(feature, kde = True, ax = ax_hist2, bins = bins) if bins else sns.histplot(feature, kde = True, ax = ax_hist2)
ax_hist2.axvline(np.mean(feature), color = 'g', linestyle = '--') # Add mean to the histogram
ax_hist2.axvline(np.median(feature), color = 'black', linestyle = '-') # Add median to the histogram
# Calculate the boundaries of outliers
iqr = np.percentile(feature, 75) - np.percentile(feature, 25)
lower_bound = np.percentile(feature, 25) - 1.5 * iqr
upper_bound = np.percentile(feature, 75) + 1.5 * iqr
# Get the minimum and maximum values (boundaries of outliers)
minimum = np.min(feature[feature >= lower_bound])
maximum = np.max(feature[feature <= upper_bound])
print('lower_bound: ', minimum)
print('upper_bound', maximum)
#__________________
# read the datafile
pd.set_option('display.max_columns', None)
df_initial = pd.read_csv('marketing_campaign+%284%29.csv')
print('Dataframe dimensions:', df_initial.shape)
if df_initial.duplicated().any():
print('hasDuplicates:', df_initial.duplicated().sum())
else:
print('hasDuplicates:', False)
#____________________________________________________________
# gives some infos on columns types and numer of null values
tab_info = pd.concat([pd.DataFrame(df_initial.dtypes).T.rename(index={0:'column type'}),
pd.DataFrame(df_initial.isnull().sum()).T.rename(index={0:'null values (nb)'}),
pd.DataFrame(df_initial.isnull().sum()/df_initial.shape[0]*100).T.rename(index={0:'null values (%)'})])
display(tab_info)
#__________________
# show first lines
display(df_initial[:5])
Dataframe dimensions: (2240, 27) hasDuplicates: False
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| column type | int64 | int64 | object | object | float64 | int64 | int64 | object | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 | int64 |
| null values (nb) | 0 | 0 | 0 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| null values (%) | 0.0 | 0.0 | 0.0 | 0.0 | 1.071429 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
df_initial.describe()
| ID | Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2240.000000 | 2240.000000 | 2216.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 | 2240.000000 |
| mean | 5592.159821 | 1968.805804 | 52247.251354 | 0.444196 | 0.506250 | 49.109375 | 303.935714 | 26.302232 | 166.950000 | 37.525446 | 27.062946 | 44.021875 | 2.325000 | 4.084821 | 2.662054 | 5.790179 | 5.316518 | 0.072768 | 0.074554 | 0.072768 | 0.064286 | 0.012946 | 0.009375 | 0.149107 |
| std | 3246.662198 | 11.984069 | 25173.076661 | 0.538398 | 0.544538 | 28.962453 | 336.597393 | 39.773434 | 225.715373 | 54.628979 | 41.280498 | 52.167439 | 1.932238 | 2.778714 | 2.923101 | 3.250958 | 2.426645 | 0.259813 | 0.262728 | 0.259813 | 0.245316 | 0.113069 | 0.096391 | 0.356274 |
| min | 0.000000 | 1893.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 2828.250000 | 1959.000000 | 35303.000000 | 0.000000 | 0.000000 | 24.000000 | 23.750000 | 1.000000 | 16.000000 | 3.000000 | 1.000000 | 9.000000 | 1.000000 | 2.000000 | 0.000000 | 3.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 5458.500000 | 1970.000000 | 51381.500000 | 0.000000 | 0.000000 | 49.000000 | 173.500000 | 8.000000 | 67.000000 | 12.000000 | 8.000000 | 24.000000 | 2.000000 | 4.000000 | 2.000000 | 5.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 8427.750000 | 1977.000000 | 68522.000000 | 1.000000 | 1.000000 | 74.000000 | 504.250000 | 33.000000 | 232.000000 | 50.000000 | 33.000000 | 56.000000 | 3.000000 | 6.000000 | 4.000000 | 8.000000 | 7.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 11191.000000 | 1996.000000 | 666666.000000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | 263.000000 | 362.000000 | 15.000000 | 27.000000 | 28.000000 | 13.000000 | 20.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
# To remove the NA values
df = df_initial.dropna()
# Number of columns that contain null values in df_initial
num_null_columns = df_initial.isnull().any().sum()
num_rows_dropped = len(df_initial) - len(df)
num_remaining_rows = len(df)
# Determine pluralization for columns
column_plural = "column" if num_null_columns == 1 else "columns"
# Print the results
print(f'Dropped {num_rows_dropped} rows from {num_null_columns} {column_plural}. The total remaining rows are {num_remaining_rows:,}.')
Dropped 24 rows from 1 column. The total remaining rows are 2,216.
# Convert Dt_Customer column to datetime
df_initial['Dt_Customer'] = pd.to_datetime(df_initial['Dt_Customer'], format='%d-%m-%Y', errors='coerce')
df_initial[df_initial.isnull().any(axis=1)]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 1994 | 1983 | Graduation | Married | NaN | 1 | 0 | 2013-11-15 | 11 | 5 | 5 | 6 | 0 | 2 | 1 | 1 | 1 | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 27 | 5255 | 1986 | Graduation | Single | NaN | 1 | 0 | 2013-02-20 | 19 | 5 | 1 | 3 | 3 | 263 | 362 | 0 | 27 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 43 | 7281 | 1959 | PhD | Single | NaN | 0 | 0 | 2013-11-05 | 80 | 81 | 11 | 50 | 3 | 2 | 39 | 1 | 1 | 3 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 48 | 7244 | 1951 | Graduation | Single | NaN | 2 | 1 | 2014-01-01 | 96 | 48 | 5 | 48 | 6 | 10 | 7 | 3 | 2 | 1 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 58 | 8557 | 1982 | Graduation | Single | NaN | 1 | 0 | 2013-06-17 | 57 | 11 | 3 | 22 | 2 | 2 | 6 | 2 | 2 | 0 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 71 | 10629 | 1973 | 2n Cycle | Married | NaN | 1 | 0 | 2012-09-14 | 25 | 25 | 3 | 43 | 17 | 4 | 17 | 3 | 3 | 0 | 3 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 90 | 8996 | 1957 | PhD | Married | NaN | 2 | 1 | 2012-11-19 | 4 | 230 | 42 | 192 | 49 | 37 | 53 | 12 | 7 | 2 | 8 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 91 | 9235 | 1957 | Graduation | Single | NaN | 1 | 1 | 2014-05-27 | 45 | 7 | 0 | 8 | 2 | 0 | 1 | 1 | 1 | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 92 | 5798 | 1973 | Master | Together | NaN | 0 | 0 | 2013-11-23 | 87 | 445 | 37 | 359 | 98 | 28 | 18 | 1 | 2 | 4 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 128 | 8268 | 1961 | PhD | Married | NaN | 0 | 1 | 2013-07-11 | 23 | 352 | 0 | 27 | 10 | 0 | 15 | 3 | 6 | 1 | 7 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 133 | 1295 | 1963 | Graduation | Married | NaN | 0 | 1 | 2013-08-11 | 96 | 231 | 65 | 196 | 38 | 71 | 124 | 1 | 6 | 5 | 7 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 312 | 2437 | 1989 | Graduation | Married | NaN | 0 | 0 | 2013-06-03 | 69 | 861 | 138 | 461 | 60 | 30 | 61 | 1 | 6 | 5 | 12 | 3 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 319 | 2863 | 1970 | Graduation | Single | NaN | 1 | 2 | 2013-08-23 | 67 | 738 | 20 | 172 | 52 | 50 | 20 | 6 | 2 | 3 | 10 | 7 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 1379 | 10475 | 1970 | Master | Together | NaN | 0 | 1 | 2013-04-01 | 39 | 187 | 5 | 65 | 26 | 20 | 14 | 2 | 4 | 2 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1382 | 2902 | 1958 | Graduation | Together | NaN | 1 | 1 | 2012-09-03 | 87 | 19 | 4 | 12 | 2 | 2 | 6 | 1 | 1 | 0 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1383 | 4345 | 1964 | 2n Cycle | Single | NaN | 1 | 1 | 2014-01-12 | 49 | 5 | 1 | 9 | 2 | 0 | 4 | 1 | 1 | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1386 | 3769 | 1972 | PhD | Together | NaN | 1 | 0 | 2014-03-02 | 17 | 25 | 1 | 13 | 0 | 0 | 3 | 1 | 1 | 0 | 3 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2059 | 7187 | 1969 | Master | Together | NaN | 1 | 1 | 2013-05-18 | 52 | 375 | 42 | 48 | 94 | 66 | 96 | 7 | 4 | 10 | 4 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2061 | 1612 | 1981 | PhD | Single | NaN | 1 | 0 | 2013-05-31 | 82 | 23 | 0 | 15 | 0 | 2 | 7 | 2 | 3 | 0 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2078 | 5079 | 1971 | Graduation | Married | NaN | 1 | 1 | 2013-03-03 | 82 | 71 | 1 | 16 | 0 | 0 | 9 | 4 | 2 | 1 | 3 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2079 | 10339 | 1954 | Master | Together | NaN | 0 | 1 | 2013-06-23 | 83 | 161 | 0 | 22 | 0 | 0 | 24 | 2 | 4 | 1 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2081 | 3117 | 1955 | Graduation | Single | NaN | 0 | 1 | 2013-10-18 | 95 | 264 | 0 | 21 | 12 | 6 | 147 | 3 | 6 | 1 | 5 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2084 | 5250 | 1943 | Master | Widow | NaN | 0 | 0 | 2013-10-30 | 75 | 532 | 126 | 490 | 164 | 126 | 126 | 1 | 5 | 5 | 11 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 2228 | 8720 | 1978 | 2n Cycle | Together | NaN | 0 | 0 | 2012-08-12 | 53 | 32 | 2 | 1607 | 12 | 4 | 22 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Questions:
def describe_dataframe(df, include_number=True):
if include_number:
df_desc = df.describe(include='number').T
else:
df_desc = df.describe(exclude='number').T
df_col = df_desc.index.tolist()
for column in df_col:
print(df[column].value_counts())
print('-' * 50)
df.describe(exclude = 'number').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Education | 2216 | 5 | Graduation | 1116 |
| Marital_Status | 2216 | 8 | Married | 857 |
| Dt_Customer | 2216 | 662 | 31-08-2012 | 12 |
# Replace 2n Cycle for Master
df.loc[df['Education'] == '2n Cycle', 'Education'] = 'Master'
# Grouped small value counts as Other
df.loc[df['Marital_Status'].isin(['Alone', 'Absurd', 'YOLO']), 'Marital_Status'] = 'Single'
df.describe(include = 'number').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 2216.0 | 5588.353339 | 3249.376275 | 0.0 | 2814.75 | 5458.5 | 8421.75 | 11191.0 |
| Year_Birth | 2216.0 | 1968.820397 | 11.985554 | 1893.0 | 1959.00 | 1970.0 | 1977.00 | 1996.0 |
| Income | 2216.0 | 52247.251354 | 25173.076661 | 1730.0 | 35303.00 | 51381.5 | 68522.00 | 666666.0 |
| Kidhome | 2216.0 | 0.441787 | 0.536896 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Teenhome | 2216.0 | 0.505415 | 0.544181 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
| Recency | 2216.0 | 49.012635 | 28.948352 | 0.0 | 24.00 | 49.0 | 74.00 | 99.0 |
| MntWines | 2216.0 | 305.091606 | 337.327920 | 0.0 | 24.00 | 174.5 | 505.00 | 1493.0 |
| MntFruits | 2216.0 | 26.356047 | 39.793917 | 0.0 | 2.00 | 8.0 | 33.00 | 199.0 |
| MntMeatProducts | 2216.0 | 166.995939 | 224.283273 | 0.0 | 16.00 | 68.0 | 232.25 | 1725.0 |
| MntFishProducts | 2216.0 | 37.637635 | 54.752082 | 0.0 | 3.00 | 12.0 | 50.00 | 259.0 |
| MntSweetProducts | 2216.0 | 27.028881 | 41.072046 | 0.0 | 1.00 | 8.0 | 33.00 | 262.0 |
| MntGoldProds | 2216.0 | 43.965253 | 51.815414 | 0.0 | 9.00 | 24.5 | 56.00 | 321.0 |
| NumDealsPurchases | 2216.0 | 2.323556 | 1.923716 | 0.0 | 1.00 | 2.0 | 3.00 | 15.0 |
| NumWebPurchases | 2216.0 | 4.085289 | 2.740951 | 0.0 | 2.00 | 4.0 | 6.00 | 27.0 |
| NumCatalogPurchases | 2216.0 | 2.671029 | 2.926734 | 0.0 | 0.00 | 2.0 | 4.00 | 28.0 |
| NumStorePurchases | 2216.0 | 5.800993 | 3.250785 | 0.0 | 3.00 | 5.0 | 8.00 | 13.0 |
| NumWebVisitsMonth | 2216.0 | 5.319043 | 2.425359 | 0.0 | 3.00 | 6.0 | 7.00 | 20.0 |
| AcceptedCmp3 | 2216.0 | 0.073556 | 0.261106 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp4 | 2216.0 | 0.074007 | 0.261842 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp5 | 2216.0 | 0.073105 | 0.260367 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp1 | 2216.0 | 0.064079 | 0.244950 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| AcceptedCmp2 | 2216.0 | 0.013087 | 0.113672 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Complain | 2216.0 | 0.009477 | 0.096907 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Response | 2216.0 | 0.150271 | 0.357417 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Leading Questions:
cat_cols = df.describe(exclude='number').T.index.tolist()
num_cols = df.describe(include='number').T.index.tolist()
cat_cols
['Education', 'Marital_Status', 'Dt_Customer']
bar_perc(df, 'Education')
bar_perc(df, 'Kidhome')
bar_perc(df, 'Teenhome')
bar_perc(df, 'Complain')
bar_perc(df, 'Marital_Status')
num_cols
['ID', 'Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Response']
histogram_boxplot(df.Income, figsize = (15,10), bins = None)
lower_bound: 1730.0 upper_bound 113734.0
df = df[df['Income'] < 200000]
histogram_boxplot(df.Income, figsize = (15,10), bins = None)
lower_bound: 1730.0 upper_bound 113734.0
histogram_boxplot(df.Year_Birth, figsize = (15,10), bins = None)
lower_bound: 1940 upper_bound 1996
histogram_boxplot(df.NumWebPurchases, figsize = (15,10), bins = None)
lower_bound: 0 upper_bound 11
histogram_boxplot(df.Year_Birth, figsize = (15,10), bins = None)
lower_bound: 1940 upper_bound 1996
histogram_boxplot(df.NumCatalogPurchases, figsize = (15,10), bins = None)
lower_bound: 0 upper_bound 10
histogram_boxplot(df.NumStorePurchases, figsize = (15,10), bins = None)
lower_bound: 0 upper_bound 13
histogram_boxplot(df.NumWebVisitsMonth, figsize = (15,10), bins = None)
lower_bound: 0 upper_bound 13
# describe_dataframe(df, include_number=False)
plt.figure(figsize = (16, 16))
sns.heatmap(df.corr(), annot = True, fmt = '0.2f') # Shows correlation between variables.
plt.show()
# Relationship between food_preparation_time and cuisine type
plt.figure(figsize=(15,7))
sns.boxplot(x = "Education", y = "Income", data = df, palette = 'PuBu')
plt.xticks(rotation = 60)
plt.show()
# Relationship between food_preparation_time and cuisine type
plt.figure(figsize=(15,7))
sns.boxplot(x = "Marital_Status", y = "Income", data = df, palette = 'PuBu')
plt.xticks(rotation = 60)
plt.show()
In this section, we will first prepare our dataset for analysis.
Think About It:
df["Age"] = 2016 - df.Year_Birth
df.drop(df[df["Age"] > 115].index, inplace=True)
histogram_boxplot(df.Age, figsize = (15,10), bins = None)
lower_bound: 20 upper_bound 76
df["Kids"] = df.Teenhome + df.Kidhome
df.Marital_Status.unique().tolist()
['Single', 'Together', 'Married', 'Divorced', 'Widow']
df["Marital_Status"].replace(["Married", "Together"], "Relationship", inplace=True)
df["Marital_Status"].replace(["Divorced", "Widow"], "Single", inplace=True)
df.Marital_Status.value_counts()
Relationship 1428 Single 784 Name: Marital_Status, dtype: int64
df["Status"] = df["Marital_Status"].replace({"Single": 1, "Relationship": 2})
df["Family_Size"] = df["Status"] + df["Kids"]
df["Expenses"] = df[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']].sum(axis=1)
df["NumTotalPurchases"] = df[['NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases']].sum(axis=1)
df["Dt_Customer"] = pd.to_datetime(df["Dt_Customer"], format="%d-%m-%Y")
df["Dt_Customer"].min()
Timestamp('2012-07-30 00:00:00')
df["Dt_Customer"].max()
Timestamp('2014-06-29 00:00:00')
# Assigning date to the day variable
df["day"] = "01-01-2015"
# Converting the variable day to Python datetime object
df["day"] = pd.to_datetime(df.day)
df["Engaged_in_days"] = (df["day"] - df["Dt_Customer"]).dt.days
df["TotalAcceptedCmp"] = df[["AcceptedCmp1", "AcceptedCmp2", "AcceptedCmp3", "AcceptedCmp4", "AcceptedCmp5", "Response"]].sum(axis=1)
df['AmountPerPurchase'] = df['Expenses'] / df['NumTotalPurchases']
df['AmountPerPurchase'].max()
inf
df[df['NumTotalPurchases'] == 0].shape[0]
4
df.drop(df[df['NumTotalPurchases'] == 0].index, inplace=True)
histogram_boxplot(df.AmountPerPurchase, figsize = (15,10), bins = None)
lower_bound: 0.5333333333333333 upper_bound 98.91666666666667
What are the the most important observations and insights from the data based on the EDA and Data Preprocessing performed?
# Dropping all the irrelevant columns and storing in data_model
data_model = df.drop(
columns=[
"ID",
"Year_Birth",
"Dt_Customer",
"day",
"Complain",
"Response",
"AcceptedCmp1",
"AcceptedCmp2",
"AcceptedCmp3",
"AcceptedCmp4",
"AcceptedCmp5",
"Marital_Status",
"Status",
"Kids",
'Education',
'Kidhome',
'Teenhome', 'Income','Age', 'Family_Size'
],
axis=1,
)
df.shape
(2208, 37)
data_model.shape
(2208, 17)
data_model.head()
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 | 1617 | 25 | 849 | 1 | 64.680000 |
| 1 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 | 27 | 6 | 299 | 0 | 4.500000 |
| 2 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 | 776 | 21 | 498 | 0 | 36.952381 |
| 3 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 | 53 | 8 | 325 | 0 | 6.625000 |
| 4 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 | 422 | 19 | 347 | 0 | 22.210526 |
from sklearn.preprocessing import StandardScaler
# Scaling the data and storing the output as a new DataFrame
scaler = StandardScaler()
numerical_columns = df.select_dtypes(include=[np.number])
data_scaled_all = pd.DataFrame(scaler.fit_transform(numerical_columns), columns = numerical_columns.columns)
data_scaled_all.head()
| ID | Year_Birth | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Age | Kids | Status | Family_Size | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.017526 | -1.018743 | 0.292925 | -0.823235 | -0.931270 | 0.311060 | 0.975863 | 1.550053 | 1.688262 | 2.451041 | 1.481762 | 0.850833 | 0.348987 | 1.425749 | 2.501580 | -0.560167 | 0.698394 | -0.282324 | -0.283258 | -0.280449 | -0.262168 | -0.115364 | -0.095607 | 2.372895 | 1.018743 | -1.266516 | -1.351719 | -1.761762 | 1.674434 | 1.317746 | 1.526655 | 0.617845 | 1.120445 |
| 1 | -1.049811 | -1.275141 | -0.261644 | 1.039166 | 0.906303 | -0.380080 | -0.874099 | -0.638273 | -0.719275 | -0.651903 | -0.634855 | -0.734941 | -0.170961 | -1.130379 | -0.572908 | -1.176741 | -0.132230 | -0.282324 | -0.283258 | -0.280449 | -0.262168 | -0.115364 | -0.095607 | -0.421426 | 1.275141 | 1.403143 | -1.351719 | 0.447193 | -0.965102 | -1.165726 | -1.190250 | -0.504032 | -0.986284 |
| 2 | -0.443690 | -0.335013 | 0.926537 | -0.823235 | -0.931270 | -0.794765 | 0.356244 | 0.569079 | -0.179808 | 1.337632 | -0.148276 | -0.038747 | -0.690910 | 1.425749 | -0.231298 | 1.289556 | -0.547542 | -0.282324 | -0.283258 | -0.280449 | -0.262168 | -0.115364 | -0.095607 | -0.421426 | 0.335013 | -1.266516 | 0.739799 | -0.657284 | 0.278302 | 0.794910 | -0.207224 | -0.504032 | 0.149781 |
| 3 | 0.185233 | 1.288844 | -1.187869 | 1.039166 | -0.931270 | -0.794765 | -0.874099 | -0.562814 | -0.656857 | -0.505882 | -0.586197 | -0.754280 | -0.170961 | -0.765218 | -0.914518 | -0.560167 | 0.283082 | -0.282324 | -0.283258 | -0.280449 | -0.262168 | -0.115364 | -0.095607 | -0.421426 | -1.288844 | 0.068313 | 0.739799 | 0.447193 | -0.921940 | -0.904308 | -1.061814 | -0.504032 | -0.911894 |
| 4 | -0.079155 | 1.032445 | 0.300213 | 1.039166 | -0.931270 | 1.555113 | -0.393821 | 0.418160 | -0.219934 | 0.151212 | -0.002303 | -0.560893 | 1.388885 | 0.330266 | 0.110311 | 0.056408 | -0.132230 | -0.282324 | -0.283258 | -0.280449 | -0.262168 | -0.115364 | -0.095607 | -0.421426 | -1.032445 | 0.068313 | 0.739799 | 0.447193 | -0.309368 | 0.533492 | -0.953138 | -0.504032 | -0.366289 |
# Scaling the data and storing the output as a new DataFrame
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data_model), columns = data_model.columns)
data_scaled.head()
| Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Expenses | NumTotalPurchases | Engaged_in_days | TotalAcceptedCmp | AmountPerPurchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.311060 | 0.975863 | 1.550053 | 1.688262 | 2.451041 | 1.481762 | 0.850833 | 0.348987 | 1.425749 | 2.501580 | -0.560167 | 0.698394 | 1.674434 | 1.317746 | 1.526655 | 0.617845 | 1.120445 |
| 1 | -0.380080 | -0.874099 | -0.638273 | -0.719275 | -0.651903 | -0.634855 | -0.734941 | -0.170961 | -1.130379 | -0.572908 | -1.176741 | -0.132230 | -0.965102 | -1.165726 | -1.190250 | -0.504032 | -0.986284 |
| 2 | -0.794765 | 0.356244 | 0.569079 | -0.179808 | 1.337632 | -0.148276 | -0.038747 | -0.690910 | 1.425749 | -0.231298 | 1.289556 | -0.547542 | 0.278302 | 0.794910 | -0.207224 | -0.504032 | 0.149781 |
| 3 | -0.794765 | -0.874099 | -0.562814 | -0.656857 | -0.505882 | -0.586197 | -0.754280 | -0.170961 | -0.765218 | -0.914518 | -0.560167 | 0.283082 | -0.921940 | -0.904308 | -1.061814 | -0.504032 | -0.911894 |
| 4 | 1.555113 | -0.393821 | 0.418160 | -0.219934 | 0.151212 | -0.002303 | -0.560893 | 1.388885 | 0.330266 | 0.110311 | 0.056408 | -0.132230 | -0.309368 | 0.533492 | -0.953138 | -0.504032 | -0.366289 |
from sklearn.manifold import TSNE
# t-SNE is a non-linear dimensionality reduction technique that focuses on preserving the local structure of the data
# Apply the t-SNE algorithm with random_state = 1
tsne = TSNE(n_components = 2, random_state = 1)
# Fit and transform t-SNE function on the scaled data
data_tsne_all = tsne.fit_transform(data_scaled_all) # Complete the code
data_tsne_all = pd.DataFrame(data = data_tsne_all, columns = ['Component 1', 'Component 2'])
mkt_tsne_all = tsne.fit_transform(data_scaled_all)
# Converting the embeddings to a dataframe
mkt_tsne_all = pd.DataFrame(mkt_tsne_all, columns = ["Component 1", "Component 2"])
# Scatter plot for two components
plt.figure(figsize = (7, 7))
sns.scatterplot(x = data_tsne_all.iloc[:,0], y = data_tsne_all.iloc[:,1])
plt.show()
Observation: Show some clear clusters when no columns are removed.
from sklearn.manifold import TSNE
# t-SNE is a non-linear dimensionality reduction technique that focuses on preserving the local structure of the data
# Apply the t-SNE algorithm with random_state = 1
tsne = TSNE(n_components = 2, random_state = 1)
# Fit and transform t-SNE function on the scaled data
data_tsne = tsne.fit_transform(data_scaled) # Complete the code
data_tsne = pd.DataFrame(data = data_tsne, columns = ['Component 1', 'Component 2'])
mkt_tsne = tsne.fit_transform(data_scaled)
# Converting the embeddings to a dataframe
mkt_tsne = pd.DataFrame(mkt_tsne, columns = ["Component 1", "Component 2"])
# Scatter plot for two components
plt.figure(figsize = (7, 7))
sns.scatterplot(x = data_tsne.iloc[:,0], y = data_tsne.iloc[:,1])
plt.show()
Observation: When demographic columns are reduced, it brings the samples closer together. Looks more like one big cluster vs man
Think about it:
Answer: PCA can help reduce the collinearity between variables and provide more meaningful clustering insights.
plt.figure(figsize = (16, 16))
sns.heatmap(data_model.corr(), annot = True, fmt = '0.2f') # Shows correlation between variables.
plt.show()
Observation:
# Importing PCA and t-SNE
from sklearn.decomposition import PCA
# PCA helps with dimensionality reduction.
# Defining the number of principal components to generate
n = data_scaled.shape[1]
# Finding principal components for the data
# Apply the PCA algorithm with random_state = 1
pca = PCA(n_components=n, random_state=1) # Complete the code
# Fit and transform the pca function on scaled data
data_pca1 = pd.DataFrame(pca.fit_transform(data_scaled)) # Complete the code
# The percentage of variance explained by each principal component
exp_var = pca.explained_variance_ratio_
data_pca_copy = pd.DataFrame(data_pca1, columns = df.columns)
exp_var
array([4.48668019e-01, 1.16533332e-01, 6.94607714e-02, 6.02872096e-02,
5.78382351e-02, 4.16408241e-02, 3.79264953e-02, 3.54636976e-02,
2.76229835e-02, 2.52984813e-02, 2.31137962e-02, 2.30471218e-02,
1.58105413e-02, 1.44767616e-02, 2.81172999e-03, 6.80243099e-33,
1.75106629e-33])
# Visualize the explained variance by individual components
plt.figure(figsize = (10, 10))
plt.plot(range(1, 18), exp_var.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()
np.cumsum(exp_var)
array([0.44866802, 0.56520135, 0.63466212, 0.69494933, 0.75278757,
0.79442839, 0.83235489, 0.86781858, 0.89544157, 0.92074005,
0.94385385, 0.96690097, 0.98271151, 0.99718827, 1. ,
1. , 1. ])
pc_comps = ['PC1', 'PC2', 'PC3', 'PC4', 'PC5', 'PC6']
data_pca = pd.DataFrame(np.round(pca.components_[:6,:], 2), index = pc_comps, columns = data_scaled.columns) # Visialize PC coefficients.
def color_high(val):
if val <= -0.40:
return 'background: pink'
elif val >= 0.40:
return 'background: skyblue'
data_pca.T.style.applymap(color_high)
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| Recency | 0.010000 | 0.000000 | -0.320000 | 0.540000 | -0.750000 | -0.120000 |
| MntWines | 0.300000 | 0.130000 | 0.290000 | -0.050000 | -0.220000 | 0.030000 |
| MntFruits | 0.250000 | -0.150000 | -0.230000 | 0.120000 | 0.280000 | -0.060000 |
| MntMeatProducts | 0.300000 | -0.150000 | 0.060000 | 0.080000 | -0.040000 | 0.330000 |
| MntFishProducts | 0.260000 | -0.160000 | -0.220000 | 0.140000 | 0.260000 | -0.050000 |
| MntSweetProducts | 0.250000 | -0.130000 | -0.220000 | 0.130000 | 0.220000 | -0.020000 |
| MntGoldProds | 0.210000 | 0.100000 | -0.110000 | 0.120000 | 0.160000 | -0.710000 |
| NumDealsPurchases | -0.020000 | 0.550000 | -0.170000 | -0.140000 | -0.020000 | 0.200000 |
| NumWebPurchases | 0.210000 | 0.380000 | -0.060000 | -0.170000 | -0.030000 | -0.300000 |
| NumCatalogPurchases | 0.300000 | -0.020000 | 0.020000 | -0.050000 | -0.110000 | 0.210000 |
| NumStorePurchases | 0.270000 | 0.130000 | -0.190000 | -0.260000 | -0.080000 | 0.100000 |
| NumWebVisitsMonth | -0.200000 | 0.410000 | 0.170000 | 0.250000 | 0.140000 | -0.120000 |
| Expenses | 0.350000 | -0.010000 | 0.130000 | 0.040000 | -0.070000 | 0.070000 |
| NumTotalPurchases | 0.300000 | 0.320000 | -0.140000 | -0.220000 | -0.090000 | 0.070000 |
| Engaged_in_days | 0.060000 | 0.370000 | 0.100000 | 0.610000 | 0.330000 | 0.330000 |
| TotalAcceptedCmp | 0.150000 | -0.000000 | 0.680000 | 0.050000 | -0.090000 | -0.230000 |
| AmountPerPurchase | 0.320000 | -0.120000 | 0.200000 | 0.140000 | -0.030000 | 0.010000 |
Observation and Insights:
# df_model.head()
Think About It:
# Importing clustering algorithms
from sklearn.cluster import KMeans
# K-Means is a distance-based clustering algorithm that aims to minimize the within-cluster sum of squares
# step 1
sse = {}
# step 2 - iterate for a range of Ks and fit the pca components to the algorithm.
for k in range(1, 10):
kmeans = KMeans(n_clusters = k, max_iter = 1000, random_state = 1).fit(data_pca1)
sse[k] = kmeans.inertia_ # Use inertia attribute from the clustering object and store the inertia value for that K
# step 3
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
sil_score = [] # Creating empty list
cluster_list = range(3, 7) # Creating a range from 3 to 7
for n_clusters in cluster_list:
# Initialize K-Means with number of clusters equal to n_clusters and random_state=1
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
# Fit and predict on the pca data
preds = clusterer.fit_predict(data_pca1)
# Calculate silhouette score - Hint: Use silhouette_score() function
score = silhouette_score(data_pca1, preds)
# Append silhouette score to empty list created above
sil_score.append(score)
# Print the silhouette score
print("For n_clusters = {}, the silhouette score is {}".format(n_clusters, score))
For n_clusters = 3, the silhouette score is 0.27201384838171144 For n_clusters = 4, the silhouette score is 0.25661259468245723 For n_clusters = 5, the silhouette score is 0.23583061465612465 For n_clusters = 6, the silhouette score is 0.13010333058878457
kmeans = KMeans(n_clusters = 3, random_state = 1)
kmeans.fit(data_pca1)
KMeans(n_clusters=3, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=3, random_state=1)
data_pca1["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the data_pca data
df["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to the whole data
data_scaled["K_means_segments_3"] = kmeans.labels_ # Adding K-Means cluster labels to data_scaled
silhouette_avg = silhouette_score(data_pca1, kmeans.labels_)
print("The silhouette score for data_pca1 is:", silhouette_avg)
The silhouette score for data_pca1 is: 0.29586573188903226
# Df stores methods and silhouettte score
results = []
# Append method and score to the results list
results.append({"Method": 'Kmeans_3', "Score": silhouette_avg})
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}]
# Function to visualize PCA data with clusters formed
def PCA_PLOT(X, Y, PCA, cluster):
sns.scatterplot(x=X, y=1, data=PCA, hue=cluster)
PCA_PLOT(0, 1, data_pca1, "K_means_segments_3")
cluster_profile_KMeans_3 = data_pca1.groupby('K_means_segments_3').mean()
cluster_profile_KMeans_3.style.highlight_max(color="lightgreen", axis=0)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_3 | |||||||||||||||||
| 0 | 0.768588 | 1.374578 | -0.211799 | -0.431943 | -0.153868 | -0.107772 | -0.153745 | -0.156302 | 0.040737 | -0.044713 | -0.005825 | -0.024341 | 0.061141 | -0.020934 | 0.071660 | 0.000000 | 0.000000 |
| 1 | -2.484489 | -0.360640 | 0.076136 | 0.134909 | 0.050476 | 0.014695 | 0.036093 | 0.046186 | 0.005626 | -0.002559 | -0.000593 | -0.000505 | -0.026724 | 0.014590 | -0.035199 | 0.000000 | 0.000000 |
| 2 | 3.772440 | -0.846935 | 0.092243 | 0.225994 | 0.076007 | 0.091663 | 0.102623 | 0.086674 | -0.055443 | 0.054129 | 0.007534 | 0.027816 | -0.017803 | -0.004023 | -0.013654 | 0.000000 | 0.000000 |
# Columns to use in boxplot
# col_for_box = ['Income','Kidhome','Teenhome','Recency','MntWines','MntFruits','MntMeatProducts','MntFishProducts','MntSweetProducts','MntGoldProds','NumDealsPurchases','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumWebVisitsMonth','Complain','Age','Family_Size','Expenses','NumTotalPurchases','Engaged_in_days','TotalAcceptedCmp','AmountPerPurchase']
# Creating boxplot for each of the variables
all_col = ['Income','NumDealsPurchases','NumWebVisitsMonth', 'NumWebPurchases','TotalAcceptedCmp', 'Expenses','NumTotalPurchases']
plt.figure(figsize = (20, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=df[variable], x=df['K_means_segments_3'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Think About It:
Summary of each cluster:
Cluster 0:
Cluster 1:
Cluster 2:
# Drop columns
data_pca1.drop("K_means_segments_3", axis=1, inplace=True)
df.drop("K_means_segments_3", axis=1, inplace=True)
kmeans = KMeans(n_clusters = 5, random_state = 1)
kmeans.fit(data_pca1)
KMeans(n_clusters=5, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=5, random_state=1)
data_pca1["K_means_segments_5"] = kmeans.labels_ # Adding K-Means cluster labels to the data_pca data
df["K_means_segments_5"] = kmeans.labels_ # Adding K-Means cluster labels to the whole data
data_scaled["K_means_segments_5"] = kmeans.labels_ # Adding K-Means cluster labels to data_scaled
silhouette_avg = silhouette_score(data_pca1, kmeans.labels_)
print("The silhouette score for data_pca1 is:", silhouette_avg)
The silhouette score for data_pca1 is: 0.2894335595121166
# Append method and score to the results list
results.append({"Method": 'Kmeans_5', "Score": silhouette_avg})
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}]
PCA_PLOT(0, 1, data_pca1, "K_means_segments_5")
cluster_profile_KMeans_5 = data_pca1.groupby('K_means_segments_5').mean()
cluster_profile_KMeans_5.style.highlight_max(color="lightgreen", axis=0)
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_segments_5 | |||||||||||||||||
| 0 | -2.552404 | -0.478570 | 0.081441 | 0.135359 | 0.045405 | 0.002666 | 0.011847 | 0.021778 | 0.010197 | 0.000036 | 0.002878 | -0.007709 | -0.036259 | 0.013451 | -0.046011 | 0.000000 | 0.000000 |
| 1 | 1.578934 | 0.442909 | -0.504456 | -0.600998 | -0.192214 | -0.053760 | -0.457999 | -0.406103 | 0.090076 | -0.030772 | 0.089143 | 0.036990 | -0.091341 | -0.018903 | 0.081260 | 0.000000 | 0.000000 |
| 2 | -0.285200 | 2.021514 | -0.019120 | -0.139724 | -0.035601 | -0.052136 | 0.290267 | 0.183086 | -0.051196 | -0.013738 | -0.073046 | -0.015239 | 0.215441 | -0.020764 | 0.078775 | 0.000000 | 0.000000 |
| 3 | 4.155756 | -0.297457 | 2.260443 | 0.280019 | -0.626976 | -0.155703 | -0.144649 | 0.378139 | 0.248659 | -0.151026 | -0.074191 | 0.021742 | 0.078000 | -0.032695 | -0.060042 | 0.000000 | 0.000000 |
| 4 | 3.850381 | -1.108559 | -0.690979 | 0.342767 | 0.437119 | 0.186212 | 0.284502 | 0.054486 | -0.204137 | 0.123750 | -0.004193 | -0.016664 | -0.046377 | 0.020934 | -0.017435 | 0.000000 | 0.000000 |
# Creating boxplot for each of the variables
plt.figure(figsize = (20, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=df[variable], x=df['K_means_segments_5'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Cluster 0: Customers in this group have a moderate income and don't frequently take advantage of deals. They visit websites a fair amount and make a moderate to high number of online purchases. They also accept a low number of marketing campaigns. Their total purchases are moderate, and each purchase tends to be for a relatively low to moderate amount.
Cluster 1: Customers in this group have a moderate income and don't frequently take advantage of deals. They visit websites a fair amount and make a moderate to high number of online purchases. They also accept a low number of marketing campaigns. Their total purchases are moderate, and each purchase tends to be for a relatively low to moderate amount.
Cluster 2: With a moderate income, these customers make the most deal purchases. They visit websites frequently and make a moderate number of online purchases. They don't accept many marketing campaigns. Their total number of purchases is moderate, and each purchase tends to be for a relatively low amount.
Cluster 3: These customers have a high income but make the lowest number of purchases when there are deals. They have a low to moderate number of web visits but make a moderate number of web purchases. They accept the highest number of marketing campaigns. Their total number of purchases is moderate, and each purchase tends to be for a moderate to high amount.
Cluster 4: Customers in this group have a high income but make the lowest number of deal purchases. They have the lowest web visitation and make a low to moderate number of web purchases. They don't accept many marketing campaigns. Their total number of purchases is moderate, and each purchase tends to be for a moderate amount.
from sklearn_extra.cluster import KMedoids
# Creating copy of the data to store labels from each algorithm
# data_scaled_copy = data_k_pca.copy(deep = True)
for n_clusters in range(2, 5):
# Initialize K-Medoids with the specified number of clusters
kmedo = KMedoids(n_clusters=n_clusters, random_state=1)
kmedo.fit(data_pca1)
# Predict cluster labels for the data points
df[f'kmedoLabels_{n_clusters}'] = kmedo.predict(data_pca1)
# Calculate silhouette score
silhouette_avg = silhouette_score(data_pca1, kmedo.labels_)
results.append({"Method": f'Medioids_{n_clusters}', "Score": silhouette_avg})
print(f"The silhouette score for data_pca1 with {n_clusters} clusters is: {silhouette_avg}")
The silhouette score for data_pca1 with 2 clusters is: 0.3127386542698528 The silhouette score for data_pca1 with 3 clusters is: 0.12521679142543043 The silhouette score for data_pca1 with 4 clusters is: 0.11421820199844311
# Print silhouette scores
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}, {'Method': 'Medioids_2', 'Score': 0.3127386542698528}, {'Method': 'Medioids_3', 'Score': 0.12521679142543043}, {'Method': 'Medioids_4', 'Score': 0.11421820199844311}]
Observation:
plt.figure(figsize=(20, 20))
for i, col in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(x='kmedoLabels_2', y=col, data=df)
plt.tight_layout()
plt.title(col)
plt.show()
Observation:
plt.figure(figsize=(20, 20))
for i, col in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(x='kmedoLabels_3', y=col, data=df)
plt.tight_layout()
plt.title(col)
plt.show()
Observation:
Observations and Insights: Overall, the clusters don't provide many insights as there is not enough variance between clusters.
Cluster 0: Highest income, highest number of web visits
Cluster 1: Similar to cluster 2
Cluster 2: Too similar to cluster 1
Observations and Insights: Not very insightful from K-mediod clusters.
# Drop columns
data_pca1.drop("K_means_segments_5", axis=1, inplace=True)
df.drop("K_means_segments_5", axis=1, inplace=True)
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average"]
high_cophenet_corr = 0 # Creating a variable by assigning 0 to it
high_dm_lm = [0, 0] # Creating a list by assigning 0's to it
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(data_pca1, metric=dm, method=lm) # Applying different linkages with different distance on data_pca
c, coph_dists = cophenet(Z, pdist(data_pca1)) # Calculating cophenetic correlation
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c: # Checking if cophenetic correlation is higher than previous score
high_cophenet_corr = c # Appending to high_cophenet_corr list if it is higher
high_dm_lm[0] = dm # Appending its corresponding distance
high_dm_lm[1] = lm # Appending its corresponding method or linkage
Cophenetic correlation for Euclidean distance and single linkage is 0.8129807227587553. Cophenetic correlation for Euclidean distance and complete linkage is 0.7537524321636122. Cophenetic correlation for Euclidean distance and average linkage is 0.8151761018960105. Cophenetic correlation for Chebyshev distance and single linkage is 0.8051749754976828. Cophenetic correlation for Chebyshev distance and complete linkage is 0.6813736478845351. Cophenetic correlation for Chebyshev distance and average linkage is 0.774784181804776. Cophenetic correlation for Mahalanobis distance and single linkage is 0.5376203582123958. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.3693232898614882. Cophenetic correlation for Mahalanobis distance and average linkage is 0.5881693054344688. Cophenetic correlation for Cityblock distance and single linkage is 0.8149845035005294. Cophenetic correlation for Cityblock distance and complete linkage is 0.7627696622102742. Cophenetic correlation for Cityblock distance and average linkage is 0.8411010987060117.
# Printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.8411010987060117, which is obtained with Cityblock distance and average linkage.
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca1, metric="Cityblock", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram
coph_corr, coph_dist = cophenet(Z, pdist(data_pca1)) # Finding cophenetic correlation for different linkages with city block distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Think about it:
# List of linkage methods
linkage_methods = ["single", "complete", "average"]
# Lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# To create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30)) # Setting the plot size
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(data_pca1, metric="chebyshev", method=method) # Measures the distances between two clusters
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)") # Title of dendrogram0
coph_corr, coph_dist = cophenet(Z, pdist(data_pca1)) # Finding cophenetic correlation for different linkages with Chebyshev distance
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
# Show the plot
plt.tight_layout()
plt.show()
Complete linkage generates better separated clusters.
plt.figure(figsize = (20, 7))
plt.title("Dendrograms")
dend = dendrogram(linkage(data_scaled, method = 'complete'))
plt.axhline(y = 15, color = 'r', linestyle = '--')
<matplotlib.lines.Line2D at 0x15a50e09970>
# data_pca1 = data_pca1.drop('dendrogram', axis=1)
# data_pca1.head()
HCmodel = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
HCmodel.fit(data_pca1)
AgglomerativeClustering(affinity='euclidean', n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
data_pca1["dendrogram"] = HCmodel.labels_ # Adding K-Means cluster labels to the data_pca data
df["dendrogram"] = HCmodel.labels_ # Adding K-Means cluster labels to the whole data
data_scaled["dendrogram"] = HCmodel.labels_ # Adding K-Means cluster labels to data_scaled
silhouette_avg = silhouette_score(data_pca1, HCmodel.labels_)
print("The silhouette score for data_pca1 is:", silhouette_avg)
The silhouette score for data_pca1 is: 0.22837691444799138
# Append method and score to the results list
results.append({"Method": 'Dendrogram', "Score": silhouette_avg})
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}, {'Method': 'Medioids_2', 'Score': 0.3127386542698528}, {'Method': 'Medioids_3', 'Score': 0.12521679142543043}, {'Method': 'Medioids_4', 'Score': 0.11421820199844311}, {'Method': 'Dendrogram', 'Score': 0.22837691444799138}]
PCA_PLOT(0, 1, data_pca1, "dendrogram")
# Creating boxplot for each of the variables
plt.figure(figsize = (20, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=df[variable], x=df['dendrogram'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
df.dendrogram.value_counts()
1 928 0 591 2 450 3 239 Name: dendrogram, dtype: int64
Observations and Insights:
Summary of each cluster:
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
data_pca1.drop("dendrogram", axis=1, inplace=True)
# Initializing lists
eps_value = [2,3] # Taking random eps value
min_sample_values = [6,20] # Taking random min_sample value
# Creating a dictionary for each of the values in eps_value with min_sample_values
res = {eps_value[i]: min_sample_values for i in range(len(eps_value))}
from sklearn.cluster import DBSCAN
# Finding the silhouette_score for each of the combinations
high_silhouette_avg = 0 # Assigning 0 to the high_silhouette_avg variable
high_i_j = [0, 0] # Assigning 0's to the high_i_j list
key = res.keys() # Assigning dictionary keys to a variable called key
for i in key:
z = res[i] # Assigning dictionary values of each i to z
for j in z:
db = DBSCAN(eps=i, min_samples=j).fit(data_pca1) # Applying DBSCAN to each of the combination in dictionary
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
silhouette_avg = silhouette_score(data_pca1, labels) # Finding silhouette score
print(
"For eps value =" + str(i),
"For min sample =" + str(j),
"The average silhoutte_score is :",
silhouette_avg, # Printing the silhouette score for each of the combinations
)
if high_silhouette_avg < silhouette_avg: # If the silhouette score is greater than 0 or the previous score, it will get appended to the high_silhouette_avg list with its combination of i and j
high_i_j[0] = i
high_i_j[1] = j
high_silhouette_avg = silhouette_avg
For eps value =2 For min sample =6 The average silhoutte_score is : 0.07631293391537533 For eps value =2 For min sample =20 The average silhoutte_score is : 0.3445089447463355 For eps value =3 For min sample =6 The average silhoutte_score is : 0.3307327349278307 For eps value =3 For min sample =20 The average silhoutte_score is : 0.34151752330706703
# Printing the highest silhouette score
print("Highest silhouette average is {:.2f} for eps = {} and min_samples = {}".format(high_silhouette_avg, high_i_j[0], high_i_j[1]))
Highest silhouette average is 0.34 for eps = 2 and min_samples = 20
# Append method and score to the results list
results.append({"Method": 'DBSCAN', "Score": high_silhouette_avg})
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}, {'Method': 'Medioids_2', 'Score': 0.3127386542698528}, {'Method': 'Medioids_3', 'Score': 0.12521679142543043}, {'Method': 'Medioids_4', 'Score': 0.11421820199844311}, {'Method': 'Dendrogram', 'Score': 0.22837691444799138}, {'Method': 'DBSCAN', 'Score': 0.3445089447463355}]
dbs = DBSCAN(eps=3, min_samples=20)
data_pca1["DBSCAN_cluster"] = kmeans.labels_ # Adding DBSCAN cluster labels to the data_pca data
df["DBSCAN_cluster"] = kmeans.labels_ # Adding DBSCAN cluster labels to the whole data
data_scaled["DBSCAN_cluster"] = kmeans.labels_ # Adding DBSCAN cluster labels to data_scaled
PCA_PLOT(0, 1, data_pca1, "DBSCAN_cluster")
Acceptable distribution with some overlap near the right side.
# Creating boxplot for each of the variables
# all_col = ['Income','Teenhome', 'Kidhome', 'Age', 'Family_Size', 'Expenses', 'NumTotalPurchases', 'Engaged_in_days', 'TotalAcceptedCmp', 'AmountPerPurchase']
plt.figure(figsize = (20, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=df[variable], x=df['DBSCAN_cluster'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations and Insights:
Think about it:
Summary of each cluster:
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
data_pca1 = data_pca1.drop('DBSCAN_cluster', axis=1)
# data_pca1 = data_pca1.drop('GMM_cluster', axis=1)
from sklearn.mixture import GaussianMixture
for n_components in range(3, 7):
gmm = GaussianMixture(n_components=n_components, random_state=1)
preds = gmm.fit_predict(data_pca1)
score = silhouette_score(data_pca1, preds)
print(f"Number of Components: {n_components}")
print(f"Silhouette Score: {score}")
results.append({"Method": f"GMM_{n_components}", "Score": score})
Number of Components: 3 Silhouette Score: 0.16256857703931887 Number of Components: 4 Silhouette Score: 0.15384146936782728 Number of Components: 5 Silhouette Score: 0.12086634023827815 Number of Components: 6 Silhouette Score: 0.06504147539555859
# Append method and score to the results list
results.append({"Method": 'GMM', "Score": score})
print(results)
[{'Method': 'Kmeans_3', 'Score': 0.29586573188903226}, {'Method': 'Kmeans_5', 'Score': 0.2894335595121166}, {'Method': 'Medioids_2', 'Score': 0.3127386542698528}, {'Method': 'Medioids_3', 'Score': 0.12521679142543043}, {'Method': 'Medioids_4', 'Score': 0.11421820199844311}, {'Method': 'Dendrogram', 'Score': 0.22837691444799138}, {'Method': 'DBSCAN', 'Score': 0.3445089447463355}, {'Method': 'GMM_3', 'Score': 0.16256857703931887}, {'Method': 'GMM_4', 'Score': 0.15384146936782728}, {'Method': 'GMM_5', 'Score': 0.12086634023827815}, {'Method': 'GMM_6', 'Score': 0.06504147539555859}, {'Method': 'GMM', 'Score': 0.06504147539555859}]
Observations and Insights:
gmm = GaussianMixture(n_components=4, random_state=1)
preds = gmm.fit_predict(data_pca1)
data_pca1["GMM_cluster"] = preds # Adding GMM cluster labels to the data_pca data
df["GMM_cluster"] = preds # Adding GMM cluster labels to the whole data
#
data_scaled["GMM_cluster"] = preds # Adding GMM cluster labels to data_scaled
df["GMM_cluster"].value_counts()
0 750 2 601 1 443 3 414 Name: GMM_cluster, dtype: int64
PCA_PLOT(0, 1, data_pca1, "GMM_cluster")
# Creating boxplot for each of the variables
plt.figure(figsize = (20, 50))
for i, variable in enumerate(all_col):
plt.subplot(6, 4, i + 1)
sns.boxplot(y=df[variable], x=df['GMM_cluster'],showmeans=True)
plt.tight_layout()
plt.title(variable)
plt.show()
Observation: Clusters overlap within the box plots.
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
Observations and Insights:
Summary of each cluster:
# Convert the results list to a DataFrame
results_df = pd.DataFrame(results)
styled_results = results_df.style.background_gradient(subset=["Score"], cmap="Greens")
# Print the results DataFrame
display(styled_results)
| Method | Score | |
|---|---|---|
| 0 | Kmeans_3 | 0.295866 |
| 1 | Kmeans_5 | 0.289434 |
| 2 | Medioids_2 | 0.312739 |
| 3 | Medioids_3 | 0.125217 |
| 4 | Medioids_4 | 0.114218 |
| 5 | Dendrogram | 0.228377 |
| 6 | DBSCAN | 0.344509 |
| 7 | GMM_3 | 0.162569 |
| 8 | GMM_4 | 0.153841 |
| 9 | GMM_5 | 0.120866 |
| 10 | GMM_6 | 0.065041 |
| 11 | GMM | 0.065041 |
Technique Evaluation:
Room to improve:
1.The DBSCAN model is recommended for adoption due to its highest silhouette score and ability to provide meaningful insights. The clusters it generates exhibit the best separation.
Original questions:
prefered_model_stats = df.groupby('DBSCAN_cluster')['NumTotalPurchases'].count()
total_counts = prefered_model_stats.sum()
prefered_model_stats = prefered_model_stats.to_frame(name='CustomerCount')
prefered_model_stats['Percentage of Total Customers'] = prefered_model_stats['CustomerCount'] / total_counts * 100
formatted_prefered_model_stats = prefered_model_stats.round().astype(int)
formatted_prefered_model_stats["CustomerCount"] = formatted_prefered_model_stats["CustomerCount"]
formatted_prefered_model_stats["Percentage of Total Customers"] = formatted_prefered_model_stats["Percentage of Total Customers"]
styled_prefered_model_stats = formatted_prefered_model_stats.style.background_gradient(cmap="RdYlGn")
styled_prefered_model_stats.background_color = 'lightblue'
styled_prefered_model_stats.set_caption("Customer Count")
| CustomerCount | Percentage of Total Customers | |
|---|---|---|
| DBSCAN_cluster | ||
| 0 | 969 | 44 |
| 1 | 401 | 18 |
| 2 | 347 | 16 |
| 3 | 159 | 7 |
| 4 | 332 | 15 |
df.groupby('DBSCAN_cluster')['Expenses', 'NumTotalPurchases'].mean()
| Expenses | NumTotalPurchases | |
|---|---|---|
| DBSCAN_cluster | ||
| 0 | 87.796698 | 7.507740 |
| 1 | 867.887781 | 21.683292 |
| 2 | 518.135447 | 19.285303 |
| 3 | 1659.962264 | 21.301887 |
| 4 | 1404.900602 | 20.756024 |
prefered_model_stats = df.groupby('DBSCAN_cluster')['Expenses', 'NumTotalPurchases'].sum()
prefered_model_stats['AmountPerPurchase'] = prefered_model_stats['Expenses'] / prefered_model_stats['NumTotalPurchases']
formatted_prefered_model_stats = prefered_model_stats.round().astype(int)
formatted_prefered_model_stats = formatted_prefered_model_stats.rename(columns={"Expenses": "Revenue", "NumTotalPurchases": "Total Purchases"})
formatted_prefered_model_stats["AmountPerPurchase"] = formatted_prefered_model_stats["AmountPerPurchase"]
# Calculate the percent of total for each column
total_revenue = formatted_prefered_model_stats["Revenue"].sum()
total_purchases = formatted_prefered_model_stats["Total Purchases"].sum()
formatted_prefered_model_stats["PercentRevenue"] = (formatted_prefered_model_stats["Revenue"] / total_revenue * 100).round().astype(int)
formatted_prefered_model_stats["PercentPurchases"] = (formatted_prefered_model_stats["Total Purchases"] / total_purchases * 100).round().astype(int)
styled_prefered_model_stats = formatted_prefered_model_stats.style.background_gradient(cmap="RdYlGn")
styled_prefered_model_stats.background_color = 'lightblue'
styled_prefered_model_stats.set_caption("Customer Total Sum")
| Revenue | Total Purchases | AmountPerPurchase | PercentRevenue | PercentPurchases | |
|---|---|---|---|---|---|
| DBSCAN_cluster | |||||
| 0 | 85075 | 7275 | 12 | 6 | 22 |
| 1 | 348023 | 8695 | 40 | 26 | 26 |
| 2 | 179793 | 6692 | 27 | 13 | 20 |
| 3 | 263934 | 3387 | 78 | 20 | 10 |
| 4 | 466427 | 6891 | 68 | 35 | 21 |
prefered_model_stats_per_customer = df.groupby('DBSCAN_cluster')['Expenses', 'NumTotalPurchases'].mean()
prefered_model_stats_per_customer['AmountPerPurchase'] = prefered_model_stats['Expenses'] / prefered_model_stats['NumTotalPurchases']
formatted_prefered_model_stats_per_customer = prefered_model_stats_per_customer.round().astype(int)
formatted_prefered_model_stats_per_customer["Expenses"] = formatted_prefered_model_stats_per_customer["Expenses"]
formatted_prefered_model_stats_per_customer["NumTotalPurchases"] = formatted_prefered_model_stats_per_customer["NumTotalPurchases"]
styled_prefered_model_stats_per_customer = formatted_prefered_model_stats_per_customer.style.background_gradient(cmap="RdYlGn")
styled_prefered_model_stats_per_customer.background_color = 'lightblue'
styled_prefered_model_stats_per_customer.set_caption("Purchase Stats per Customer")
| Expenses | NumTotalPurchases | AmountPerPurchase | |
|---|---|---|---|
| DBSCAN_cluster | |||
| 0 | 88 | 8 | 12 |
| 1 | 868 | 22 | 40 |
| 2 | 518 | 19 | 27 |
| 3 | 1660 | 21 | 78 |
| 4 | 1405 | 21 | 68 |
In order to develop a better understanding of our ideal customers, it is important not to immediately jump to prescriptions based solely on data analysis. Instead, we should engage in conversations with the customers and cashiers to gain deeper insights. By actively talking to them, we can uncover the contextual nuances and obtain firsthand information about their preferences and behaviors. This approach will allow us to bridge the gap between the data and real-life experiences, enabling the development of a more informed and effective marketing strategy tailored to the needs and desires of our target audience.
3. Refined insights:
Customer profile 0: Represents the largest number of customers, making up 44% of the total customer base.
Customer profile 1: Represents the highest number of purchases, accounting for 26% of all purchases. The overall number of purchases appears to be more consistent compared to revenue.
Customer profile 2: Demonstrates moderate characteristics across various metrics.
Customer profile 3: Exhibits the highest average purchase amount but represents the smallest percentage of customers, only 7%. They make the fewest number of purchases.
Customer profile 4: Contributes the highest percentage of revenue, accounting for 35%. They have a moderate number of purchases and a significant average purchase amount..
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
Cluster 4:
We should redirect marketing efforts away from cluster 0 to cluster 1 or 4 which produce more revenue.
The marketing campaign dataset 2,240 customers and tracks 27 features. After dropping columns with missing income, which accounted for 1% of the data, exploratory data analysis (EDA) was conducted. Some outliers were removed, followed by feature engineering, where values of columns were aggregated. PCA (Principal Component Analysis) was applied to help reduce the collinearity between variables and provide more meaningful clustering insights.
Multiple clustering models were built and evaluated using silhouette scores and insights from box plots. The DBSCAN technique outperformed others with a score of 0.34. DBSCAN is a density-based clustering algorithm that effectively identifies clusters of arbitrary shape and handles noise or outliers in the data. Compared to other techniques, its highest score indicates that it captured the underlying structure more effectively and produced meaningful clusters. Based on an analysis of the DBSCAN clusters, it is recommended to redirect marketing efforts from cluster 0 to clusters 1 or 4. These clusters have shown a greater potential for generating higher revenue. By reallocating marketing efforts towards clusters 1 and 4, businesses can target segments that have proven to be more lucrative. This strategic shift in focus will result in improved resource allocation and better marketing outcomes..
Improvement suggestions include enhancing feature engineering techniques, effectively handling outliers, and consulting industry and company experts for valuable insights to enhance the clustering process. Additionally, engaging in conversations with customers and cashiers is recommended to gain a better understanding of ideal customers. By bridging the gap between data analysis and real-life experiences, we can develop a more informed and effective marketing strategy that is tailored to the needs and desires of the target audience.
The final solution design proposal suggests adopting the DBSCAN model due to its highest silhouette score and ability to provide meaningful insights. The model captures the underlying structure, identifies clusters of arbitrary shape, and effectively handles noise or outliers. Further evaluation and adjustments should be conducted iteratively to enhance the model's performance and refine the analysis, considering the impact on revenue and costs.
The problem at hand is to identify different segments within the existing customer base of a marketing campaign dataset. This involves analyzing the customers' spending patterns and past interactions with the company to gain insights into their behavior and preferences.
Reason for the proposed solution design: The proposed solution design suggests using the DBSCAN model. This choice is based on the model's high silhouette score, indicating its ability to capture meaningful clusters within the data. DBSCAN is known for its capability to identify clusters of arbitrary shape and effectively handle noise or outliers in the dataset.
Key Points of the Proposed Solution Design:
Potential impact on the problem/business: The proposed solution design will provide valuable insights into the customer base and their segmentation. It will address key questions such as the different customer segments and their average spending, the key features that describe each segment, and which segments should be focused on or avoided. By understanding the customer segments and their behavior, businesses can tailor their marketing strategies, products, and services to better meet the needs and preferences of each segment. This targeted approach can lead to increased customer satisfaction, improved marketing campaign effectiveness, and potentially higher revenue. The solution's validity lies in its use of a reliable clustering algorithm, evaluation based on the Silhouette score, and the iterative refinement process to ensure the best possible results.
Key Recommendations to Implement the Solution:
Key Actionables for Stakeholders:
Benefits of the Solution:
Costs and Benefits Assumptions
Key Risks and Challenges:
Further Analysis and Associated Problems: